CalEnviroScreen uses annual mean concentration of PM2.5 (weighted average of easured monitor concentrations and satellite observations, µg/m3), over three years (2015 to 2017) as the indicator of PM 2.5. According to map, places around San Francisco Bay, like Napa show higher concentrations of PM2.5 than other places, one reason might be that there are many busy ports and transportation centers in these places, which produce more air pollution.

Based on data collected over three years(2015 to 2017), CalEnviroScreen calculates the modeled, age-adjusted rate of ED visits for asthma per 10,000. And we can see that asthma occurrence is higher in places around San Francisco Bay, Grizzly bay and Suisun bay.

If we use R’s build-in geom_smooth() function to find the best-fit curve, we can see that its not a linear regression model. When concentration of PM2.5 is smaller than 8 or larger than 9.5, asthma occurrence seems unrelated to PM2.5. But within the rage of 8-9.5, its positively correlated.

## 
## Call:
## lm(formula = Asthma ~ PM2.5, data = ces4_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54.47 -25.89  -9.61  12.94 182.95 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -116.278     13.040  -8.917   <2e-16 ***
## PM2.5         19.862      1.534  12.950   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.49 on 1578 degrees of freedom
## Multiple R-squared:  0.09606,    Adjusted R-squared:  0.09549 
## F-statistic: 167.7 on 1 and 1578 DF,  p-value: < 2.2e-16

If we perform the linear regression analysis, an increase of 19.862 in Asthma per 10,000 is associated with an increase of 1 in annual mean concentration of PM2.5; “9.55% of the variation in Asthma per 10,000 is explained by the variation in Concentration of PM2.5.

If we plot the residues of linear regression, the mean of residue is not 0, and there is apparent skew to the density curve, which means we cann’t meaningfully interpret regression results on the data.

## 
## Call:
## lm(formula = Asthma ~ PM2.5, data = ces4_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -54.47 -25.89  -9.61  12.94 182.95 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -116.278     13.040  -8.917   <2e-16 ***
## PM2.5         19.862      1.534  12.950   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.49 on 1578 degrees of freedom
## Multiple R-squared:  0.09606,    Adjusted R-squared:  0.09549 
## F-statistic: 167.7 on 1 and 1578 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = log(Asthma) ~ PM2.5, data = ces4_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.00402 -0.46479  0.03313  0.42298  1.75525 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.69234    0.22840   3.031  0.00248 ** 
## PM2.5        0.35633    0.02686  13.264  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6566 on 1578 degrees of freedom
## Multiple R-squared:  0.1003, Adjusted R-squared:  0.09974 
## F-statistic: 175.9 on 1 and 1578 DF,  p-value: < 2.2e-16

If we do the log transformation before applying the linear regression, the mean of residues is now close to 0, and the curve is nearly symmetric, it is a more apporpriate fit than linear regression.